library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gapminder)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ readr   1.1.1
## ✔ tibble  1.4.2     ✔ purrr   0.2.5
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ ggplot2 3.0.0     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Part 1: Factor Management

  1. Elaboration of the gapminder dataset - Drop Oceania

Filter the Gapminder data to remove observations associated with the continent of Oceania. To start with I will examine the gapminder dataset to explore the Oceania varible.

“Oceania” is one of the levels within the “Continent” variable, and that the continent variable is a factor with 5 levels. Remember that we can think of a factor as a vector that

levels(gapminder$continent) # gives us all the levels of the continent factor variable
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
class(gapminder$continent)  # tells us the type of variable that continent is
## [1] "factor"

Now we can filter out the “Oceania” variable. If we compare the original ‘gapminder’ dataset and the ‘gapNoOceania’ dataset, then we can see that our code has worked and the 24 Oceania data points have been removed.

gapNoOceania <- gapminder %>% 
  filter(!(continent == "Oceania")) # the ! filters out everything in the brackets

summary(gapminder$continent) # shows us that initially Oceania has 24 observations
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24
summary(gapNoOceania$continent) # shows us that our code worked and those 24 obvs in Oceania have been removed.
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360        0

HOWEVER! If we look at the levels within the variables, we will see that the Oceania level still remains. Let’s take a look.

levels(gapminder$continent) # shows the levels of the continent variable -> Oceania is still there (as it should be)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
levels(gapNoOceania$continent) # shows the levels of the continent variable -> Oceania is still there even though it shouldn't be
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

So let’s remove the unused factor levels. And double check our data to make sure we have 24 less rows in the gapNoOceania dataset.

gapminder %>% 
  str() # Continent has 5 levels and 1704 observations
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
gapNoOceania %>% 
  droplevels() %>% 
  str() # Continent has 4 levels and 1680 observations
## Classes 'tbl_df', 'tbl' and 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Make a tibble with one row per year and columns for life expectancy for two or more countries -> this code creates a dataframe will all countries, the year, and the mean life expectancy for that year

gapminder %>% 
  group_by(country, year) %>% 
  summarize(mlifeExp = mean(lifeExp)) # mean life expectancy
## # A tibble: 1,704 x 3
## # Groups:   country [?]
##    country      year mlifeExp
##    <fct>       <int>    <dbl>
##  1 Afghanistan  1952     28.8
##  2 Afghanistan  1957     30.3
##  3 Afghanistan  1962     32.0
##  4 Afghanistan  1967     34.0
##  5 Afghanistan  1972     36.1
##  6 Afghanistan  1977     38.4
##  7 Afghanistan  1982     39.9
##  8 Afghanistan  1987     40.8
##  9 Afghanistan  1992     41.7
## 10 Afghanistan  1997     41.8
## # ... with 1,694 more rows
  1. Elaboration of the Gapminder Dataset - Reorder Levels

Reorder the levels of country or continent. Use the forcats package to change the order of the factor levels, based on a principled summary of one of the quantitative variables. Consider experimenting with a summary statistic beyond the most basic choice of the median.

So normally R just orders the levels alphabetically.

levels(gapminder$country)
##   [1] "Afghanistan"              "Albania"                 
##   [3] "Algeria"                  "Angola"                  
##   [5] "Argentina"                "Australia"               
##   [7] "Austria"                  "Bahrain"                 
##   [9] "Bangladesh"               "Belgium"                 
##  [11] "Benin"                    "Bolivia"                 
##  [13] "Bosnia and Herzegovina"   "Botswana"                
##  [15] "Brazil"                   "Bulgaria"                
##  [17] "Burkina Faso"             "Burundi"                 
##  [19] "Cambodia"                 "Cameroon"                
##  [21] "Canada"                   "Central African Republic"
##  [23] "Chad"                     "Chile"                   
##  [25] "China"                    "Colombia"                
##  [27] "Comoros"                  "Congo, Dem. Rep."        
##  [29] "Congo, Rep."              "Costa Rica"              
##  [31] "Cote d'Ivoire"            "Croatia"                 
##  [33] "Cuba"                     "Czech Republic"          
##  [35] "Denmark"                  "Djibouti"                
##  [37] "Dominican Republic"       "Ecuador"                 
##  [39] "Egypt"                    "El Salvador"             
##  [41] "Equatorial Guinea"        "Eritrea"                 
##  [43] "Ethiopia"                 "Finland"                 
##  [45] "France"                   "Gabon"                   
##  [47] "Gambia"                   "Germany"                 
##  [49] "Ghana"                    "Greece"                  
##  [51] "Guatemala"                "Guinea"                  
##  [53] "Guinea-Bissau"            "Haiti"                   
##  [55] "Honduras"                 "Hong Kong, China"        
##  [57] "Hungary"                  "Iceland"                 
##  [59] "India"                    "Indonesia"               
##  [61] "Iran"                     "Iraq"                    
##  [63] "Ireland"                  "Israel"                  
##  [65] "Italy"                    "Jamaica"                 
##  [67] "Japan"                    "Jordan"                  
##  [69] "Kenya"                    "Korea, Dem. Rep."        
##  [71] "Korea, Rep."              "Kuwait"                  
##  [73] "Lebanon"                  "Lesotho"                 
##  [75] "Liberia"                  "Libya"                   
##  [77] "Madagascar"               "Malawi"                  
##  [79] "Malaysia"                 "Mali"                    
##  [81] "Mauritania"               "Mauritius"               
##  [83] "Mexico"                   "Mongolia"                
##  [85] "Montenegro"               "Morocco"                 
##  [87] "Mozambique"               "Myanmar"                 
##  [89] "Namibia"                  "Nepal"                   
##  [91] "Netherlands"              "New Zealand"             
##  [93] "Nicaragua"                "Niger"                   
##  [95] "Nigeria"                  "Norway"                  
##  [97] "Oman"                     "Pakistan"                
##  [99] "Panama"                   "Paraguay"                
## [101] "Peru"                     "Philippines"             
## [103] "Poland"                   "Portugal"                
## [105] "Puerto Rico"              "Reunion"                 
## [107] "Romania"                  "Rwanda"                  
## [109] "Sao Tome and Principe"    "Saudi Arabia"            
## [111] "Senegal"                  "Serbia"                  
## [113] "Sierra Leone"             "Singapore"               
## [115] "Slovak Republic"          "Slovenia"                
## [117] "Somalia"                  "South Africa"            
## [119] "Spain"                    "Sri Lanka"               
## [121] "Sudan"                    "Swaziland"               
## [123] "Sweden"                   "Switzerland"             
## [125] "Syria"                    "Taiwan"                  
## [127] "Tanzania"                 "Thailand"                
## [129] "Togo"                     "Trinidad and Tobago"     
## [131] "Tunisia"                  "Turkey"                  
## [133] "Uganda"                   "United Kingdom"          
## [135] "United States"            "Uruguay"                 
## [137] "Venezuela"                "Vietnam"                 
## [139] "West Bank and Gaza"       "Yemen, Rep."             
## [141] "Zambia"                   "Zimbabwe"
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

But there are many ways we could organize the data. Let’s start with the number of observations in each variable.

cont <- gapminder$continent

cont %>% 
  fct_infreq() %>% 
  levels() # This gives us the continent levels in descending order by count: "Africa"   "Asia"     "Europe"   "Americas" "Oceania" 
## [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"
cont %>% 
  fct_infreq() %>% # instead of alphabetical order we now have descending order by count
  qplot() # this will plot them

summary(cont) # this gives you the accompanying counts
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

Let’s change the order to reflect the life expectancy using forcats.

gapminder %>%
  mutate(continent = fct_reorder(continent, lifeExp)) %>% 
  ggplot(aes(lifeExp, continent)) + geom_point()

Part 2: File I/O

Experiment with one or more of write_csv()/read_csv() (and/or TSV friends), saveRDS()/readRDS(), dput()/dget(). Create something new, probably by filtering or grouped-summarization of Singer or Gapminder. I highly recommend you fiddle with the factor levels, i.e. make them non-alphabetical (see previous section). Explore whether this survives the round trip of writing to file then reading back in.

Let’s start with changng the levels for the gapminder continent variable.

gapNew <- gapminder %>% 
          mutate(continent = fct_infreq(continent))

levels(gapminder$continent)    
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
levels(gapNew$continent)
## [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"

Now let’s write a .csv with the gapNew dataset.

write.csv(gapNew, file = "gapNew.csv")

Now let’s bring it back.

gapNewNew <- read.csv("gapNew.csv")

Now let’s see what the levels are for continent.

levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
levels(gapNewNew$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
levels(gapNew$continent)
## [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"

Nope. The level change didn’t survive. You have to change the levels everytime :(

Part 3: Visualization design

Remake at least one figure or create a new one, in light of something you learned in the recent class meetings about visualization design and color. Maybe juxtapose your first attempt and what you obtained after some time spent working on it. Reflect on the differences. If using Gapminder, you can use the country or continent color scheme that ships with Gapminder. Consult the dimensions listed in All the Graph Things.

Then, make a new graph by converting this visual (or another, if you’d like) to a plotly graph. What are some things that plotly makes possible, that are not possible with a regular ggplot2 graph?

So one of the things that plotly does really well is 3d graphs. I don’t think that ggplot does this very well. So let’s explore the differences. Let’s say we wanted to examine the relationship between population life expectancy and gdp percentage. In ggplot, we would have to make multiple scatterplots.

LifeGDP <- ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) +
  geom_point() +
  xlab("GDP Per Cap") + 
  ylab("Life Expectancy") + 
  ggtitle("Life Expectancy and GDP")

LifeGDP

ggplot(gapminder, aes(x=gdpPercap, y=pop)) +
  geom_point() +
  xlab("GDP Per Cap") + 
  ylab("Population") + 
  ggtitle("Population and GDP")

In plotly, we can plot these all into the same 3D plot.

plot_ly(gapminder, 
        x = ~gdpPercap, 
        y = ~lifeExp, 
        z = ~pop,
        type = "scatter3d",
        mode = "markers",
        opacity = 0.2)

Part 4: Visualization design

Part 4: Writing figures to file

Use ggsave() to explicitly save a plot to file. Then use ! [ Alt text ] ( /path /to /img. png) to load and embed it in your report. You can play around with various options, such as:

Arguments of ggsave(), such as width, height, resolution or text scaling. Various graphics devices, e.g. a vector vs. raster format. Explicit provision of the plot object p via ggsave(…, plot = p). Show a situation in which this actually matters.

ggsave("LifeGDPplot.png", LifeGDP,
      width = 5, height = 4, dpi = 300, units = "in", device='png')
#![Alt text](/path/to/img.png)